set.seed(12345678, sample.kind="Rounding")
series_picker <-function(id){
set.seed(12345678, sample.kind="Rounding")
myseries <- aus_retail %>%
filter(
`Series ID` == sample(aus_retail$`Series ID`,1),
Month < yearmonth("2018 Jan")
)
}
my_data <- series_picker(28710967) %>% select(Month, Turnover)
training_set <- my_data[1:405,]
test_set <- my_data[406:429,]
latest_data <- readxl::read_excel("8501011.xlsx", sheet=2, skip=9) %>%
rename(Month = `Series ID`) %>%
gather(`Series ID`, Turnover, -Month) %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index=Month,key=`Series ID`) %>%
filter(`Series ID` == "A3349365F", year(Month) > 2017) %>%
select(-`Series ID`)
After generates the data from the function, I subset the data into the training set, which will be used to train the model, and the test set, which will be used to pick the better performance model.
The training set contains 405 observations, from 04/1982 to 12/2015 while the test set includes the data from 2016 and 2017, 24 observations in total.
Meanwhile, for the final part, we also need the latest data from ABS (2018 Jan – 2019 March), and we name the data set as “latest_data.”
#my_data %>% summary()
my_data %>% glimpse()
## Rows: 429
## Columns: 2
## $ Month <mth> 1982 Apr, 1982 May, 1982 Jun, 1982 Jul, 1982 Aug, 1982 Sep, 1…
## $ Turnover <dbl> 18.4, 18.3, 17.7, 18.1, 17.7, 18.5, 20.3, 20.8, 25.7, 19.6, 1…
my_data %>% autoplot(Turnover) +
xlab("Times (quarter)") +
ylab("Turnover (Million AUD)") +
ggtitle("WA Food's Retail")
my_data %>% gg_tsdisplay(Turnover)+
ggtitle("Time series Description")
my_data %>% gg_subseries(Turnover)+
ggtitle("Subseries plot to speculate seasonalitu")
my_data %>% gg_season(Turnover)+
ggtitle("Season plot to speculate seasonalitu")
#STL
my_decp <- my_data %>%
model(STL(Turnover ~ season(window = "periodic")) ) %>%
components()
my_decp %>% autoplot()
my_decp %>% gg_subseries(season_year)
The data set in this project is West Australia’s cafe, restaurant, and takeaway food service’s retail data, Time from April 1982 to Dec 2017, 429 observations in total. The unit of variable turnover (retail amount) is in $Million AU. \(\text{Turnover} \in [17.7, 521.3]\), average turnover is $A184.2M
From the graph, the turnover’s amount shows a very obvious increasing trend, with the trend slowing down (damped) from the year 2012 to the year 2016. Then the growth comes back. The ACF plot indicates that the data set has very strong autocorrelation, so it is non-stationary or trend stationary.
From the seasonal prospect, the plot shows that the Turnover amount will peak in December, probably due to the Chrismas holiday, and then touch the bottom in the next year’s February. So a seasonal pattern exist.
For further information, I also use the STL decompose method to investigate the seasonal pattern in depth. The data touches the bottom in February, gradually climb up and finally climax in December.
Since the variance of the data set is not unified (by observing the plot), a transformation will be necessary. And from the above discussion, the data is non-stationary, so the mean of the data need stables as well.
my_data %>% autoplot(log(Turnover))
my_data %>% features(Turnover, feature = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.309
my_data %>% autoplot(
box_cox(Turnover, 0.3)
) + xlab("Time (Month)")+
ylab("Turnover (After Box Cox transformation)") +
ggtitle("WA food retial (After transformation)")
From the plot, the log transformation over correct the variance problem a little bit. Part of the plot looks convex.
Using the Guerrero method, the algorithm report \(\lambda = 0.263\).
For simplicity, this project uses \(\lambda = 0.3\) to transform the data, and the result is showing in the last graph.
my_data %>% gg_tsdisplay(box_cox(Turnover, 0.3))
my_data %>% gg_tsdisplay(difference(box_cox(Turnover, 0.3)))+
ggtitle("First difference") +
ylab("Difference y")
To stable the data, differences are necessary. From the first order difference, the data seems stationary, but ACF plot indicates a seasonal pattern still exists. With perk comes back after lag 12, and repeat at lag 24.
my_data %>% gg_tsdisplay(difference(box_cox(Turnover, 0.3), lag = 12))+
ggtitle("Seasonal Difference") +
ylab("Difference y")
If only apply seasonal difference, the data is still non-stational, so a step further second-order difference is required.
my_data %>% gg_tsdisplay(difference(box_cox(Turnover, 0.3), lag = 12) %>%
difference()) +
ggtitle("Two Steps Difference") +
ylab("Difference y")
With two times of difference, the plot seems stationary.
my_data %>%
mutate(trans = box_cox(Turnover, 0.3)) %>%
features(trans, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 7.00 0.01
my_data %>%
mutate(fir_diff =
difference( box_cox(Turnover, 0.3) )
) %>%
features(fir_diff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0245 0.1
my_data %>%
mutate(fir_diff =
difference( box_cox(Turnover, 0.3) , lag = 12)
) %>%
features(fir_diff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.186 0.1
my_data %>%
mutate(fist_diff = difference(box_cox(Turnover, 0.3)) %>% difference()) %>%
features(fist_diff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.00792 0.1
The KPSS test (Unit root) test, shows an extremely small p-value (0.01) for no difference data, indicates that it exists unit root.
But after difference (include first difference, seasonal difference, and two steps difference), the data shows no unit root exists. So we could fit the model with the after difference data without concerns.
After two times difference, the ACF plot reports a pike at lag 1, indicate a non-seasonal MA(1), and a pike at lag12 indicate a seasonal MA(1)
From the PACF plot, the lag1 and lag2 are both over the lower bound of the confidence level, indicate a non-seasonal AR(2), the same pattern observed at lag12 and lag13, which means a seasonal AR(2)
So the potential models are:
ARIMA(0,1,1)(0,1,1)[12]
ARIMA(2,1,0)(2,1,0)[12]
I also plan to fit a complete ARIMA model with both the AR part and the MA part like:
ARIMA(2,1,1)(2,1,1)[12]
Clearly, from the plot, the data shows no quadratic pattern, so since the model already have two difference, no constant needed in all those models.
I also ask the algorithm to automatically pick a model.
arima_fit <- training_set %>% model(
ar = ARIMA(box_cox(Turnover, 0.3) ~ pdq(2,1,0) + PDQ(2,1,0)),
ma = ARIMA(box_cox(Turnover, 0.3) ~ pdq(0,1,1) + PDQ(0,1,1)),
arima = ARIMA(box_cox(Turnover,0.3) ~ pdq(2,1,1) + PDQ(2,1,1)),
auto = ARIMA(box_cox(Turnover, 0.3), stepwise = FALSE)
)
arima_fit %>% select("auto") %>%
report()
## Series: Turnover
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift
## Transformation: box_cox(Turnover, 0.3)
##
## Coefficients:
## ar1 ma1 sma1 constant
## 0.9723 -0.2226 -0.7807 0.0102
## s.e. 0.0127 0.0529 0.0358 0.0015
##
## sigma^2 estimated as 0.03034: log likelihood=125.11
## AIC=-240.22 AICc=-240.07 BIC=-220.35
The automatic select model is an ARIMA(1,0,1)(0,1,1)[12] model.
I select the model from the following procedures:
arima_fit %>% glance()
## # A tibble: 4 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ar 0.0347 102. -194. -194. -174. <cpl [26]> <cpl [0]>
## 2 ma 0.0306 122. -238. -238. -226. <cpl [0]> <cpl [13]>
## 3 arima 0.0307 124. -233. -233. -205. <cpl [26]> <cpl [13]>
## 4 auto 0.0303 125. -240. -240. -220. <cpl [1]> <cpl [13]>
for_arima <- arima_fit %>% forecast(h = "2 years") %>%
select(.model, Month, Turnover, .mean)
for_arima %>%
autoplot(my_data %>% filter(year(Month) > 2012)) +
facet_wrap(~.model)+
xlab("Time")+
ggtitle("Forecasitng results of four ARIMA models")
for_arima %>% accuracy(test_set)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ar Test 36.2 42.9 36.3 7.82 7.84 NaN NaN 0.731
## 2 arima Test 9.64 14.2 12.0 2.06 2.62 NaN NaN 0.447
## 3 auto Test 0.502 7.29 6.07 0.0919 1.35 NaN NaN 0.0391
## 4 ma Test 14.6 18.3 15.5 3.16 3.36 NaN NaN 0.475
Then I want to find the best one in the ETC regime From the previous plot, the data behave an upward linear trend, with a little damp after 2012. However, the increase continues afterward, so the damped term is not considered in this model . The seasonality is not apparent, so an Additive seasonal term may more than appropriate.
The candidate model is as follow
ETS(M,A,A)
ETS(M,M,A)
ETS(A,A,A)
ETS(A,M,A)
Automatic select model is also in consideration.
ets_fit <- training_set %>%
model(
MAA = ETS(box_cox(Turnover, 0.3) ~ error("M") + trend("A") + season("A")),
MMA = ETS(box_cox(Turnover, 0.3) ~ error("M") + trend("M") + season("A")),
AAA = ETS(box_cox(Turnover, 0.3) ~ error("A") + trend("A") + season("A")),
AMA = ETS(box_cox(Turnover, 0.3) ~ error("A") + trend("M") + season("A")),
auto = ETS(box_cox(Turnover, 0.3))
)
ets_fit %>% select("auto") %>%
report()
## Series: Turnover
## Model: ETS(A,A,A)
## Transformation: box_cox(Turnover, 0.3)
## Smoothing parameters:
## alpha = 0.7088371
## beta = 0.0001000037
## gamma = 0.168638
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4]
## 4.739423 0.03006481 0.1040377 -0.1032187 0.07916982 0.6146921 -0.003378355
## s[-5] s[-6] s[-7] s[-8] s[-9] s[-10]
## -0.01139191 -0.03519888 -0.1074575 -0.124238 -0.2420058 -0.09787209
## s[-11]
## -0.07313831
##
## sigma^2: 0.0314
##
## AIC AICc BIC
## 1047.284 1048.865 1115.350
The algorithm picks an ETS(AAA) model, which is already on the list.
From the long list, the same method applied to pick the winner.
#ets_fit %>% select("AAA") %>%
# forecast(h = "2 years") %>%
# autoplot(my_data %>% filter(year(Month) > 2012))
ets_fit %>% glance()
## # A tibble: 5 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MAA 0.000301 -518. 1071. 1072. 1139. 0.0313 0.0488 0.0131
## 2 MMA 0.000292 -513. 1059. 1061. 1127. 0.0313 0.0510 0.0129
## 3 AAA 0.0314 -507. 1047. 1049. 1115. 0.0301 0.0479 0.137
## 4 AMA 0.0319 -510. 1054. 1055. 1122. 0.0306 0.0491 0.138
## 5 auto 0.0314 -507. 1047. 1049. 1115. 0.0301 0.0479 0.137
for_ets <-ets_fit %>% forecast(h = "2 years")
#for_ets %>%
# autoplot(my_data %>% filter(year(Month) > 2012)) +
# facet_wrap(~.model)+
# xlab("Time")+
# ggtitle("Forecasitn results of four ETS models")
for_ets %>% accuracy(test_set)
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA Test 1.05 10.4 8.44 0.140 1.88 NaN NaN 0.404
## 2 AMA Test -10.4 14.0 11.3 -2.35 2.54 NaN NaN 0.276
## 3 auto Test 1.05 10.4 8.44 0.140 1.88 NaN NaN 0.404
## 4 MAA Test -5.19 10.6 8.36 -1.23 1.90 NaN NaN 0.321
## 5 MMA Test -7.80 11.9 9.45 -1.79 2.15 NaN NaN 0.318
ETS_AAA <- ets_fit %>% select("AAA")
ARIMA <- arima_fit %>% select("arima")
report(ETS_AAA)
## Series: Turnover
## Model: ETS(A,A,A)
## Transformation: box_cox(Turnover, 0.3)
## Smoothing parameters:
## alpha = 0.7088371
## beta = 0.0001000037
## gamma = 0.168638
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4]
## 4.739423 0.03006481 0.1040377 -0.1032187 0.07916982 0.6146921 -0.003378355
## s[-5] s[-6] s[-7] s[-8] s[-9] s[-10]
## -0.01139191 -0.03519888 -0.1074575 -0.124238 -0.2420058 -0.09787209
## s[-11]
## -0.07313831
##
## sigma^2: 0.0314
##
## AIC AICc BIC
## 1047.284 1048.865 1115.350
report(ARIMA)
## Series: Turnover
## Model: ARIMA(2,1,1)(2,1,1)[12]
## Transformation: box_cox(Turnover, 0.3)
##
## Coefficients:
## ar1 ar2 ma1 sar1 sar2 sma1
## 0.4060 0.0619 -0.6388 0.0359 -0.0656 -0.7755
## s.e. 0.3706 0.1177 0.3658 0.0780 0.0659 0.0595
##
## sigma^2 estimated as 0.03073: log likelihood=123.61
## AIC=-233.21 AICc=-232.92 BIC=-205.41
The ETS(AAA) model need to estimate three parameters and to specific 14 start values
The ARIMA model has six parameters wait to estimate.
ets_for <- ETS_AAA %>% forecast( h = "2 years")
arima_for <- ARIMA %>% forecast(h = "2 years")
ets_for %>% autoplot(test_set) +
xlab("Times") +
ylab("Turnover") +
ggtitle("Forecasting reuslt of ETS model")
arima_for %>% autoplot(test_set)+
xlab("Times") +
ylab("Turnover") +
ggtitle("Forecasting reuslt of ARIMA model")
ets_for %>% accuracy(test_set)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA Test 1.05 10.4 8.44 0.140 1.88 NaN NaN 0.404
arima_for %>% accuracy(test_set)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test 9.64 14.2 12.0 2.06 2.62 NaN NaN 0.447
From the forecasting plot, the result seems fairly similar, and we could not determine which one is better just from the plot.
However, it’s quite obvious that, from the accuracy statistic result, the ETS(AAA) model’s forecasting is more accurate since it has smaller RMSE, MAE, MPE, and all other statistics.
ETS_AAA %>% augment() %>%
gg_tsdisplay(.resid, "hist") +
ggtitle("Residual Diagnostic of ETS model")
ARIMA %>% augment() %>%
gg_tsdisplay(.resid, "hist")+
ggtitle("Residual Diagnostic of ARIMA model")
Box.test(augment(ETS_AAA)$.resid, fitdf = 16, lag = 24, type = "Ljung")
##
## Box-Ljung test
##
## data: augment(ETS_AAA)$.resid
## X-squared = 93.564, df = 8, p-value < 2.2e-16
Box.test(augment(ARIMA)$.resid, fitdf = 6 , lag = 24, type = "Ljung")
##
## Box-Ljung test
##
## data: augment(ARIMA)$.resid
## X-squared = 44.876, df = 18, p-value = 0.0004318
The final step to determine the best model is residual diagnostic.
Unfortunately, the residual of the ETS(AAA) model does not act very “White noise.” The ACF plot shows perk close to, or break the boundary several times at lag 8, lag 14, lag 15 and lag 18.
While the ARIMA has better behave residual. Besides the lag 14, no one break through the limit. The only perk may due to the 5% possibility. Residual’s histogram is more normal-ish than the ETS one.
So, I decided to pick the ARIMA(2,1,0)(2,1,0)[12] model as the best one.
Now, I will use all the data I have to estimate an ARIMA(2,1,1)(2,1,1)[12] model and ask it to forecast the next two years Western Australia’s food retail turnover rate. Then compare the estimated result with the real data from ABS, which for now, only provides data to March 2019.
for_19 <- my_data %>% model(
arima = ARIMA(box_cox(Turnover,0.3) ~ pdq(2,1,1) + PDQ(2,1,1))
) %>% forecast(h = "2 years")
for_21 <- my_data %>% model(
arima = ARIMA(box_cox(Turnover, 0.3) ~ pdq(2,1,1) + PDQ(2,1,1))
) %>% forecast(h = "4 years")
for_19 %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Two Years Forecasting of ARIMA(2,1,1)(2,1,1)[12], compare with ture value")
for_21 %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Four Years Forecasting of ARIMA(2,1,1)(2,1,1)[12], compare with ture value")
#for_19 %>% mutate(interavl = hilo(.mean, 80))
Comparing the forecast result with the true data, the result is pretty good.
Most of the forecasting value falls into the 80% forecast interval, and all of them are in the 95% interval. The basic pattern from the forecasting is correct.
for_19_ets <- my_data %>% model(
arima = ETS(box_cox(Turnover,0.3) ~ error("A") + trend("A") + season("A"))
) %>% forecast(h = 24)
for_21_ets <- my_data %>% model(
arima = ETS(box_cox(Turnover, 0.3) ~ error("A") + trend("A") + season("A"))
) %>% forecast(h = 48)
for_19_ets %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Two Years Forecasting of ETS(AAA), compare with ture value")
for_21_ets %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Four Years Forecasting of ETS(AAA), compare with ture value")
#for_19_ets %>% mutate(interavl = hilo(.distribution, 80))
for_19_auto <- my_data %>% model(
arima = ARIMA(box_cox(Turnover,0.3) ~ pdq(1,0,1) + PDQ(0,1,1))
) %>% forecast(h = 24)
for_21_auto <- my_data %>% model(
arima = ARIMA(box_cox(Turnover, 0.3) ~ pdq(1,0,1) + PDQ(0, 1,1))
) %>% forecast(h = 48)
for_19_auto %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Two Years Forecasting of ARIMA(1,0,1)(0,1,1)[12], compare with the ture value")
for_21_auto %>% autoplot(latest_data)+
xlab("Time")+
ylab("Turnover") +
ggtitle("Four Years Forecasting of ARIMA(1,0,1)(0,1,1)[12], compare with the ture value")
#for_19_auto %>% mutate(interavl = hilo(.distribution, 80))
For personal interest, I also use the ETS(AAA) and the automatic select ARIMA(1,0,1)(0,1,1)[12] model to forecast the result.
Most of ETS(AAA) model’s forecasting result is not falls into the 80% interval; some of them are not even in the 95% interval. You are indicating that this result is not very helpful in regards with the forecasting.
But, the ARIMA(1,0,1)(0,1,1)[12] (automatic select one) model outbeat my ARIMA(2,1,1)(2,1,1)[12]. With almost mimic reality performance. All the forecast result are in the 80% forecast interval, regards the fact that the test set result from this model is not the best one.
for_19 %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -34.7 36.7 34.7 -7.37 7.37 NaN NaN 0.580
for_19_ets %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -44.6 47.1 44.6 -9.50 9.50 NaN NaN 0.670
for_19_auto %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -41.6 43.9 41.6 -8.82 8.82 NaN NaN 0.675
for_21 %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -31.9 55.5 40.6 -7.69 9.04 NaN NaN 0.729
for_21_ets %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -47.8 67.2 52.5 -10.9 11.6 NaN NaN 0.739
for_21_auto %>% accuracy(latest_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -44.0 62.5 48.6 -10.1 10.7 NaN NaN 0.723
Compare all the accuracy statistics, the automatic select model, without any surprise, has the best performance.
Overall, the final winner’s performance is well.
The forecasting gives correct pattern
The forecast interval is reasonable and useful.
However, there are still several points that could be improved or modified:
The parameter involved in the model is too many, to regress the final ARIMA model, we need to estimate eight parameters.
The term of difference in this model is two, which means that the forecasting interval is extended by the additional difference term. And from the plot, it could be found that the two difference term forecasting interval is a little bit larger than the one-term interval.
Compare with the automatic one, the ARIMA(2,1,1)(2,1,1) model is still a bit disappointing. Suggest that, different data set will suggest a different model. The best one from the learning set may not be the best one used in the real forecasting task.
Due to the nature of the ARIMA model, the forecasting interval will become larger and larger with the forecasting goes further. Means that for the long-term forecast, this model will not be practical.
From the STL decomposition plot, I found that the data has a significant change in more recent time, which means that the data has structure break. So in the real work, we may consider using a different model to fit different data. Or we could only use the recent data to fit the model.